{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/generation/langchain/handbook/xx-langchain-chunking.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/generation/langchain/handbook/xx-langchain-chunking.ipynb)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### [LangChain Handbook](https://pinecone.io/learn/langchain)\n",
    "\n",
    "# Preparing Text Data for use with Retrieval-Augmented LLMs\n",
    "\n",
    "In this walkthrough we'll take a look at an example and some of the considerations when we need to prepare text data for retrieval augmented question-answering using **L**arge **L**anguage **M**odels (LLMs)."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Required Libraries"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are a few Python libraries we must `pip install` for this notebook to run, those are:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -qU langchain tiktoken matplotlib seaborn tqdm"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Preparing Data"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this example, we will download the LangChain docs from [langchain.readthedocs.io/](https://langchain.readthedocs.io/latest/en/). We get all `.html` files located on the site like so:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!wget -r -A.html -P rtdocs https://langchain.readthedocs.io/en/latest/"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This downloads all HTML into the `rtdocs` directory. Now we can use LangChain itself to process these docs. We do this using the `ReadTheDocsLoader` like so:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.document_loaders import ReadTheDocsLoader\n",
    "\n",
    "loader = ReadTheDocsLoader('rtdocs')\n",
    "docs = loader.load()\n",
    "len(docs)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This leaves us with `389` processed doc pages. Let's take a look at the format each one contains:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "docs[0]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We access the plaintext page content like so:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(docs[0].page_content)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(docs[5].page_content)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also find the source of each document:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "docs[5].metadata['source'].replace('rtdocs/', 'https://')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Looks good, we need to also consider the length of each page with respect to the number of tokens that will reasonably fit within the window of the latest LLMs. We will use `gpt-3.5-turbo` as an example.\n",
    "\n",
    "To count the number of tokens that `gpt-3.5-turbo` will use for some text we need to initialize the `tiktoken` tokenizer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import tiktoken\n",
    "\n",
    "tokenizer = tiktoken.get_encoding('cl100k_base')\n",
    "\n",
    "# create the length function\n",
    "def tiktoken_len(text):\n",
    "    tokens = tokenizer.encode(\n",
    "        text,\n",
    "        disallowed_special=()\n",
    "    )\n",
    "    return len(tokens)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
     "Note that for the tokenizer we defined the encoder as `\"cl100k_base\"`. This is a specific tiktoken encoder which is used by `gpt-3.5-turbo`, as well as `gpt-4`, and `text-embedding-ada-002` which are models supported by OpenAI at the time of this writing. Other encoders may be available, but are used with models that are now deprecated by OpenAI.\n",
                "\n",
     "You can find more details in the [Tiktoken `model.py` script](https://github.com/openai/tiktoken/blob/main/tiktoken/model.py), or using `tiktoken.encoding_for_model`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tiktoken.encoding_for_model('gpt-3.5-turbo')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using the `tiktoken_len` function, let's count and visualize the number of tokens across our webpages."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "token_counts = [tiktoken_len(doc.page_content) for doc in docs]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's see `min`, average, and `max` values:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f\"\"\"Min: {min(token_counts)}\n",
    "Avg: {int(sum(token_counts) / len(token_counts))}\n",
    "Max: {max(token_counts)}\"\"\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now visualize:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "\n",
    "# set style and color palette for the plot\n",
    "sns.set_style(\"whitegrid\")\n",
    "sns.set_palette(\"muted\")\n",
    "\n",
    "# create histogram\n",
    "plt.figure(figsize=(12, 6))\n",
    "sns.histplot(token_counts, kde=False, bins=50)\n",
    "\n",
    "# customize the plot info\n",
    "plt.title(\"Token Counts Histogram\")\n",
    "plt.xlabel(\"Token Count\")\n",
    "plt.ylabel(\"Frequency\")\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The vast majority of pages seem to contain a lower number of tokens. But our limits for the number of tokens to add to each chunk is actually smaller than some of the smaller pages. But, how do we decide what this number should be?"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Chunking the Text\n",
    "\n",
    "At the time of writing, `gpt-3.5-turbo` supports a context window of 4096 tokens — that means that input tokens + generated ( / completion) output tokens, cannot total more than 4096 without hitting an error.\n",
    "\n",
    "So we 100% need to keep below this. If we assume a very safe margin of ~2000 tokens for the input prompt into `gpt-3.5-turbo`, leaving ~2000 tokens for conversation history and completion.\n",
    "\n",
    "With this ~2000 token limit we may want to include *five* snippets of relevant information, meaning each snippet can be no more than **400** token long.\n",
    "\n",
    "To create these snippets we use the `RecursiveCharacterTextSplitter` from LangChain. To measure the length of snippets we also need a *length function*. This is a function that consumes text, counts the number of tokens within the text (after tokenization using the `gpt-3.5-turbo` tokenizer), and returns that number. We define it like so:"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With the length function defined we can initialize our `RecursiveCharacterTextSplitter` object like so:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
    "\n",
    "text_splitter = RecursiveCharacterTextSplitter(\n",
    "    chunk_size=400,\n",
    "    chunk_overlap=20,  # number of tokens overlap between chunks\n",
    "    length_function=tiktoken_len,\n",
    "    separators=['\\n\\n', '\\n', ' ', '']\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then we split the text for a document like so:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "chunks = text_splitter.split_text(docs[5].page_content)\n",
    "len(chunks)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tiktoken_len(chunks[0]), tiktoken_len(chunks[1])"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For `docs[5]` we created `2` chunks of token length `346` and `247`.\n",
    "\n",
    "This is for a single document, we need to do this over all of our documents. While we iterate through the docs to create these chunks we will reformat them into a format that looks like:\n",
    "\n",
    "```json\n",
    "[\n",
    "    {\n",
    "        \"id\": \"abc-0\",\n",
    "        \"text\": \"some important document text\",\n",
    "        \"source\": \"https://langchain.readthedocs.io/en/latest/glossary.html\"\n",
    "    },\n",
    "    {\n",
    "        \"id\": \"abc-1\",\n",
    "        \"text\": \"the next chunk of important document text\",\n",
    "        \"source\": \"https://langchain.readthedocs.io/en/latest/glossary.html\"\n",
    "    }\n",
    "    ...\n",
    "]\n",
    "```"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `\"id\"` will be created based on the URL of the text + it's chunk number."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import hashlib\n",
    "m = hashlib.md5()  # this will convert URL into unique ID\n",
    "\n",
    "url = docs[5].metadata['source'].replace('rtdocs/', 'https://')\n",
    "print(url)\n",
    "\n",
    "# convert URL to unique ID\n",
    "m.update(url.encode('utf-8'))\n",
    "uid = m.hexdigest()[:12]\n",
    "print(uid)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then use the `uid` alongside chunk number and actual `url` to create the format needed:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = [\n",
    "    {\n",
    "        'id': f'{uid}-{i}',\n",
    "        'text': chunk,\n",
    "        'source': url\n",
    "    } for i, chunk in enumerate(chunks)\n",
    "]\n",
    "data"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we repeat the same logic across our full dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from tqdm.auto import tqdm\n",
    "\n",
    "documents = []\n",
    "\n",
    "for doc in tqdm(docs):\n",
    "    url = doc.metadata['source'].replace('rtdocs/', 'https://')\n",
    "    m.update(url.encode('utf-8'))\n",
    "    uid = m.hexdigest()[:12]\n",
    "    chunks = text_splitter.split_text(doc.page_content)\n",
    "    for i, chunk in enumerate(chunks):\n",
    "        documents.append({\n",
    "            'id': f'{uid}-{i}',\n",
    "            'text': chunk,\n",
    "            'source': url\n",
    "        })\n",
    "\n",
    "len(documents)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We're now left with `2201` documents. We can save them to a JSON lines (`.jsonl`) file like so:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "\n",
    "with open('train.jsonl', 'w') as f:\n",
    "    for doc in documents:\n",
    "        f.write(json.dumps(doc) + '\\n')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To load the data from file we'd write:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "documents = []\n",
    "\n",
    "with open('train.jsonl', 'r') as f:\n",
    "    for line in f:\n",
    "        documents.append(json.loads(line))\n",
    "\n",
    "len(documents)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "documents[0]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### (Optional) Sharing the Dataset"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We've now created our dataset and you can go ahead and use it in any way you like. However, if you'd like to share the dataset, or store it somewhere that you can get easy access to later — we can use [Hugging Face Datasets Hub](https://huggingface.co/datasets).\n",
    "\n",
    "To begin we first need to create an account by clicking the **Sign Up** button at [huggingface.co](https://huggingface.co/). Once done we click our profile button in the same location > click **New Dataset** > give it a name like *\"langchain-docs\"* > set the dataset to **Public** or **Private** > click **Create dataset**."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "ml",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.12"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
